Analysis of RNA-Seq data using self-supervised learning for vital status prediction of colorectal cancer patients

Background RNA sequencing (RNA-Seq) is a technique that utilises the capabilities of next-generation sequencing to study a cellular transcriptome i.e., to determine the amount of RNA at a given time for a given biological sample. The advancement of RNA-Seq technology has resulted in a large volume of gene expression data for analysis. Results Our computational model (built on top of TabNet) is first pretrained on an unlabelled dataset of multiple types of adenomas and adenocarcinomas and later fine-tuned on the labelled dataset, showing promising results in the context of the estimation of the vital status of colorectal cancer patients. We achieve a final cross-validated (ROC-AUC) Score of 0.88 by using multiple modalities of data. Conclusion The results of this study demonstrate that self-supervised learning methods pretrained on a vast corpus of unlabelled data outperform traditional supervised learning methods such as XGBoost, Neural Networks, and Decision Trees that have been prevalent in the tabular domain. The results of this study are further boosted by the inclusion of multiple modalities of data pertaining to the patients in question. We find that genes such as RBM3, GSPT1, MAD2L1, and others important to the computation model’s prediction task obtained through model interpretability corroborate with pathological evidence in current literature. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05347-4.


SM1. Data Procurement and Processing
The RNA-Seq data is obtained by choosing the required parameters on the Genomic Data Commons (GDC) portal, downloading the corresponding manifest file, and using the GDC Client with the manifest to download patient files [1].

SM1.1.1 Manifest File for RNA-Seq data
The manifest files for both the labelled and unlabelled RNA-Seq data were obtained directly from the GDC portal.

SM1.2.1 Manifest File for CNV Data
The manifest file for both the labelled and unlabelled CNV Data was procured from [5].
Using the GDC Client tool enabled us to download the files consisting of the focal scores by the gene for each of the 33 TCGA Projects available.

SM1.2.2 CNV Data Preprocessing
CNV Data that was procured using the manifest file has file names as aliquot IDs that need to be mapped to their respective patient submitter IDs. This is done by procuring the Aliquot ID data from the GDC Portal. A get request is sent to the GDC server at https://api.gdc.cancer.gov/cases with the following parameters -• 'fields': 'cases.patients.portions.analytes.aliquots' • 'format': 'TSV' The aliquot data is stored in a dataframe called cnv_aliquotes which looks as follows -As can be seen in the figure above, this dataframe maps each patient submitter ID to one or more aliquot IDs. Here the submitter ID is the index column. First, only submitter IDs belonging to TCGA are filtered out from the cnv_aliquotes dataframe and subsequently a submitter ID to aliquot ID mapper is created using these two dataframes. Finally, each column name in the CNV data (aliquot ID) is replaced by its corresponding submitter ID.
Next, two unnecessary columns (Gene ID, Cytoband) are removed. Additionally, the CNV data has a gene ID => submitter ID mapping of size (19729, 509) and undergoes a transpose to arrive at the correct configuration of submitter ID => gene ID mapping with a size of (509, 19729).
Next, a new column by the name of submitter_id is added to the table and made the index. The duplicate rows (same submitter ID) are removed while keeping the first occurrence.
Next, the value -1 is replaced with 2 through the dataframe as negative values are not allowed when pretraining. Finally, as stated in the paper, this CNV data undergoes further preprocessing by filtering only the top 2000 high-variance genes and subsequently performing a Chi-Squared analysis on the data. Chi-squared analysis was performed using k = {128, 256} and the best results were produced with k = 256.

Clinical Data Procurement
The clinical data was procured by the R script provided in [5], using the TCGABioLinks R Package. First, the GDC portal was queried for all TCGA project IDs, using which clinical data for each of the 33 TCGA Projects were collected and merged into a single table. This was written onto a TSV file for preprocessing which is entailed in the section below.

SM1.3.2 Clinical Data Preprocessing
The original dataset size of clinical data is (11315, 141). First, an information-based filter is applied to filter out only the features that are known to be informative. Next are replaced with race = 'hispanic or latino'. Finally, the ethnicity column is removed. This is because we don't require both race and ethnicity columns.
• Next, all patients that don't have vital status recorded are filtered out.
• Next, patients that have both days_to_death and days_to_last_follow_up missing are removed as patients must have at least one of these recorded.
• Next, patients that are alive but are missing days_to_follow_up are removed.
Similarly, patients that are dead but are missing days_to_death are also removed.

SM2. Other Feature Reduction Techniques
As mentioned in the paper, a few other feature reduction and feature selection techniques were explored (with TabNet) on RNA-Seq data during the course of this study. However, none of them provided better results as compared to our three standard feature selection methods.

SM2.1 Laplacian Score
Laplacian Score is a feature selection technique that provides a score to each feature that reflects its locality-preserving power. Laplacian Score was used as a feature reduction technique and selected a total of 74 features.

SM3.3 Multimodal Results with other models(apart from Tabnet):
The efficacy of self-supervised learning using multiple modalities of data were further experimented with the following set of models:

SM5. Hyperparameter Tuning
Optuna is the main hyperparameter tuning framework used in this study. It relies on setting the workflow as an objective function and the evaluation metric as a parameter to optimize. It is known to perform efficient hyperparameter search, can run in parallel and has a model-agnostic API.
Optuna was employed to work on TabNet due to the variety of hyperparameters that can be tuned on the model, namely:

SM5.1 Hyperparameter tuning for TabNet -RNA-Seq
As explained in the introduction to this section, Optuna was run on a total of 5 trials, which yields the following hyperparameters: The following parallel coordinate plot was obtained for the 5 trials that Optuna was used to tune the hyperparameters:

SM5.2 Hyperparameter tuning for TabNet -Multimodal
The following hyperparameters were obtained after Optuna completed 5 trials with the TabNet model on the multimodal data.

Hyperparameter Name Value
Learning Rate 0.057 The following parallel coordinate plot was obtained post hyperparameter tuning.
As explained earlier the parallel plot gives us a view of the hyperparameters used for each trial based on their objective value. It can be observed here that low values of momentum were seemingly preferred.
The corresponding hyperparameter importance plot was also obtained: It can be inferred that the number of epochs influences the objective value greatly. The patienceScheduler and lr have the next highest importance for the objective value. It is also observed that similar to unimodal TabNet, the sparsity-inducing regularization has the least impact on the objective value.